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G\ ' 

Natural gradient descent is a principled method for adapting the parameters of a statistical 
model on-line using an underlying Riemannian parameter space to redefine the direction of steepest 
descent. The algorithm is examined via methods of statistical physics which accurately characterize 
both transient and asymptotic behavior. A solution of the learning dynamics is obtained for the case 
of multilayer neural network training in the limit of large input dimension. We find that natural 
gradient learning leads to optimal asymptotic performance and outperforms gradient descent in 
the transient, significantly shortening or even removing plateaus in the transient generalization 
<—j ■ performance which typically hamper gradient descent training. 
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I. INTRODUCTION 

One of the most popular forms of neural network training is on-line learning, in which training examples (input- 
output pairs) are presented sequentially and independently at each learning iteration (for an overview of on-line 
learning in neural networks, see [1]). Natural gradient descent (NGD) was recently proposed by Amari as a principled 
alternative to standard on-line gradient descent (GD) [2]. When learning the parameters of a statistical model, in 
our case a feedforward neural network, this algorithm has the desirable properties of asymptotic optimality, given a 
I ' realizable learning problem and differentiable model, and invariance to reparametrizations of our model distribution. 
J> , NGD is already established as a popular on-line algorithm for Independent Component Analysis [2] and shows much 
£N| promise for other statistical learning problems. Yang and Amari recently introduced an NGD algorithm for training a 
multilayer perceptron [3,4]. In this paper we provide an analysis of NGD for this problem using a statistical mechanics 
formalism. Our results indicate that NGD provides significantly improved performance over GD and we quantify these 
gains for both the transient and asymptotic stages of learning (preliminary results from this work have been reported 
in [5])^ 

The intuition behind NGD comes from viewing the parameter space of a statistical model as a Riemannian space. 
A natural measure of infinitesimal distance between probability distributions is given by the Kullback-Leibler diver- 
gence [6]. In this case the Fisher information matrix can be shown to be the appropriate Riemannian metric. The 
natural gradient direction is defined as the direction of steepest descent under this metric and is obtained by pre- 
i ■ multiplying the standard Euclidean error gradient with the inverse of the Fisher information matrix. Since this metric 
is derived from a divergence between neighboring model distributions, the algorithm is clearly independent of model 
parametrization. An additional beneficial feature of using this matrix pre- multiplier is that it remains positive-definite 
and therefore ensures convergence to a minimum of the generalization error (assuming the learning rate is annealed 
appropriately). This is to be contrasted with other variable-metric algorithms which utilize the inverse averaged 
Hessian matrix. Pre-multiplying the error gradient with the inverse Hessian may make other fixed points stable, so 
that the algorithm could converge to maxima or saddle points on the mean error surface. Although such methods 
can be adapted to ensure a positive-definite matrix pre-multiplier, such adaptations are rather ad-hoc in nature and 
are not theoretically well motivated outside of the asymptotic regime. 

Variable-metric methods are often difficult to implement as on-line algorithms since they require the averaging and 
inversion of a large matrix. In the case of NGD we require knowledge of the input distribution in order to calculate 
the Fisher information matrix. Yang and Amari discuss methods for pre-processing training examples in order to 
obtain a whitened Gaussian process for the inputs [4]. If this is possible then, when the input dimension N is large 
compared to the number of hidden units K, inversion of the Fisher information for two-layer feedforward networks 
requires only 0(N 2 ) operations, providing an efficient and practical algorithm in many cases. Such a simplification is 
not possible for Hessian based methods, because the Hessian involves an average over input-output pairs. In general it 
will not be possible to apply this pre-processing because the input distribution may be far from Gaussian and difficult 
to estimate. In this case other on-line methods will be required in order to approximate the NGD algorithm. We 
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have recently proposed a method based on a matrix momentum algorithm [7] which allows efficient on-line inversion 
and averaging of the Fisher information matrix. This algorithm can be shown to approximate NGD closely and also 
provides optimal asymptotic performance, although at the cost of introducing an extra variable parameter [8]. 

Here, we will consider the idealized situation in which we have the Fisher information matrix at our disposal. Wc 
solve the averaged dynamics of NGD using a statistical mechanics framework which becomes exact as N — > oo for 
finite K (see, for example, [9-16]). This allows us to compare performance with standard GD in both the transient 
and asymptotic phases of learning, so that we can quantify the advantage that NGD can be expected to provide. 
Numerical results for a small network provide evidence of improved performance. In order to obtain more generic 
results we introduce a site-symmetric ansatz for the special case of a realizable learning scenario, so that we can 
efficiently explore a broad range of task complexity and non-linearity. We show that trapping time in an unstable 
fixed point which dominates the training time, the symmetric phase, is significantly reduced by using NGD and exhibits 
a slower power law increase as task complexity grows. We also find that asymptotic performance is greatly improved, 
with the generalization performance of NGD equalling the known universal asymptotics for batch learning [17]. 



II. NATURAL GRADIENT DESCENT 

We consider a probabilistic model pj (CIO f° r the distribution of a scalar output ( given a vector of inputs £ € R N 
which is parameterized by J e M. KN . The Kullback-Leibler divergence provides an appropriate measure for the 
distance between distributions [6] and for two nearby points in parameter space we find, 

KL( Pj (c,Olbj + dj(C,£)) = fa p(£) fap^ lo ^(pj^§^) 

~ dJ T GdJ , (1) 

where G is the Fisher information matrix, 

G(J) = fa p(t) fapj({\t) (V J logp J (C|0)(Vjlogp J (C|0) T ■ (2) 

This matrix provides a Riemannian metric within the space of model parameters. We choose the training error 
e j(C> £) — l°gPj(CI£)- The direction of steepest descent within this Riemannian space in terms of expected error is 
obtained by pre- multiplying the mean Euclidean error gradient with G 1 [2]. 

In an on-line learning scheme we draw inputs sequentially £ M : /i — 1, 2, . . . from some distribution p(£) each labelled 
according to some stochastic rule £>b(C m I£ M )- The NGD algorithm is defined by a corresponding sequence of weight 
updates, 

J M+1 = J^ + ^G-VjlogpjCCir) • (3) 

where the learning rate is scaled by the input dimension for convenience. This algorithm therefore utilizes an unbiased 
estimate of the steepest descent direction in our Riemannian parameter space. If the rule can be realized by the model 
and exemplars are corrupted by output noise then annealing the learning rate as r\ = l/a at late times (where a = n/N) 
results in optimal asymptotic performance in terms of the quadratic estimation error, saturating the Cramer-Rao lower 
bound and equalling in performance even the best batch algorithms [2]. 

Consider the deterministic mapping </>j(£) = ^ i=1 g (Jj £) which defines a soft committee machine (we call this 
the student network), where g(x) is some sigmoid activation function for the hidden units, J = {Ji}i<i<K is the set 
of input to hidden weights and the hidden to output weights are set to one. We choose the following Gaussian noise 
model, 

PJ(CI€) = 7= 5= CX P 7Tj2 • ( 4 ) 



The Fisher information matrix for this model distribution is given by G = A/cr^ with A in block form, 

a«= [dtp(a)g'(jjog'(J]oa T ■ (5) 



A particularly convenient choice for activation function is g{x) = erf(x/v / 2) as this allows the average over inputs to 
be carried out analytically for an isotropic Gaussian input distribution p{£) = W(0, 1), 
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((1 + Qjj)JiJj + (1 + Q^JjJj - Q«(JiJj + J, J?)) 



(6) 



where Ay = (1 + Qu)(l + Qjj) - Q 2 } and Q tj = JjJj. 



III. STATISTICAL MECHANICS FRAMEWORK 



In order to analyse NGD beyond the asymptotic regime we use a statistical mechanics description of the learning 
process which is exact in the limit of large input dimension TV and provides an accurate model of mean behaviour 
for realistic values of N [9,10]. We consider the case where outputs are generated by a teacher network corrupted by 
Gaussian noise, 

where <j) B (^) = E„=i5( B nO- Due to thc flexibility of this mapping [18] we can represent a variety of learning 
scenarios within this framework. The weight update at each iteration of NGD is then given by, 

jf +1 = jr + ^E^A^, (8) 

where 6? = g'iJj^) [0b (£ M ) — 4>J (£ M ) + P' 1 ] an d p 11 is zero-mean Gaussian noise of variance a 2 . Notice that knowledge 
of the noise variance is not required to execute this algorithm since the contributions from the Fisher information 
matrix and log- likelihood cancel (recall Eq. (3)). The model noise variance is therefore not included as a variable 
parameter and the algorithm is well defined even in the deterministic case where a 2 ^ — ► 0. 

The Fisher information matrix can be inverted using the partitioning method described in [4] (see appendix A); 
each block is some additive combination of thc identity matrix and outer products of the student weight vectors, 

A^ = M + £ejf ; j fe jT. ( 9 ) 

kl 

where 9ij are scalars while Q v are K dimensional square matrices. Using the methods described in [9] it is then 
straightforward to derive equations of motion for a set of order parameters = Qij, J^B„ = R in and B^B m = 

T nm , measuring the various overlaps between student and teacher vectors. These order parameters are necessary and 
sufficient to determine the generalization error e g = (^(4>j(£) — 0b(O) 2 ){, which we defined to be the expected error 
in the absence of noise [9] . The equations of motion are in the form of coupled first order differential equations for 
the order parameters with respect to the normalized number of examples, 



dRi n 

da 
da 



= r)f in (R,Q,T) , 

= V g lk (R,Q,T) + v 2 h lk {R,Q,T,<7 2 ) . (10) 



where R = [Ri n ], Q = [Qik] and T = [T nm \. The explicit expressions are given in appendix B. These equations can 
be integrated numerically in order to determine the evolution of the generalization error. 



IV. NUMERICAL RESULTS 



In Fig. 1 we show an example of the NGD dynamics for a realizable and noiseless learning scenario (K = M = 2, 
T nm = S nm )- Fig. 1(a) shows the evolution of the generalization error while Figs. 1(b) and (c) show the student- 
teacher and student-student overlaps respectively (the indices have been re-ordered a posteriori) . We have used initial 
conditions corresponding to an input dimension of about N ~ 10 6 , although we expect the dynamical equations to 
describe mean behavior accurately for much smaller systems as was found to be the case for GD [10]. The dashed 
line in Fig. 1(a) shows the effect of reducing the initial conditions for each R in and Qi^k by a factor of 10~ 3 , which 
corresponds to an input dimension of about TV ~ 10 12 . The symmetric phase seems to grow logarithmically as N 
increases, as was also found to be the case for GD [12]. 
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As is the case for GD [9] the dynamics for this example can be characterized by two major phases of learning, the 
symmetric phase and asymptotic convergence. Following a short initial transient the order parameters are trapped 
in a subspace characterized by a lack of differentiation between the activities of different teacher nodes. After an 
initial reduction, the generalization error remains at a constant non-zero value and the student-teacher overlaps are 
virtually indistinguishable. This symmetric phase is an unstable fixed point of the dynamics and eventually small 
perturbations due to the random initial conditions lead to escape and convergence towards zero generalization error. 
If the teacher is deterministic, as in this example, then the generalization error converges to zero exponentially unless 
the learning rate is chosen too large (see the inset to Fig. 1(a)). If the teacher's output is corrupted by noise then 
the learning rate must be annealed in order for the generalization error to decay asymptotically (we will consider this 
regime in more detail in section VB). 

The dynamics differs from the GD result in that the symmetric phase is typically less pronounced, although the 
dashed line in Fig. 1(a) shows how the symmetric phase increases in duration as N increases (because of a reduced 
asymmetry in the initial conditions). The dynamics for GD and NGD are qualitatively different for small learning 
rates, where fluctuations in the gradient are completely suppressed and the i] 2 terms in Eq. (10) can be neglected. In 
this limit the symmetric phase disappears completely for NGD, while it still dominates the learning time for GD. The 
symmetric phase is a fluctuation driven phenomena for NGD, rather than a perturbation around the deterministic 
result. As described in the next section, this makes analysis of the symmetric phase more difficult than for GD since 
a small learning rate expansion is no longer meaningful. 

A quantitative comparison of GD and NGD is difficult because both algorithms have a free parameter, the learning 
rate 77, which can be chosen arbitrarily and which will be critical to performance. In order to make a principled 
comparison we choose to compare the algorithms when their learning rates are chosen to be optimal. This can be 
achieved by using a variational method which allows us to determine the globally optimal time-dependent learning 
rate for each algorithm [14]. The resulting learning rate optimizes the total change in generalization error over a fixed 
time-window and is found by extremitization of the following functional (see [14] for details and results for optimized 



Numerical results suggest that the optimal learning rates determined here are close to the critical learning rate within 
the symmetric phase for both methods, above which the student weight vector norms increase without bound. 

In Fig. 2 we compare the performance of optimized GD and optimized NGD for a two-node student learning from a 
noiseless, isotropic teacher starting from the same initial conditions (K = 2, T nm = 5 nm ). Figs. 2(a), (b) and (c) show 
results for teachers with M = 1, M = 2 and M = 3 hidden nodes respectively. In each case the optimal learning rate 
schedule for NGD is shown by the inset. It should be noted that although there is a significant temporal variation 
in the optimized 77, very similar performance would be achieved by choosing 77 to be fixed at its average value. We 
see that NGD significantly outperforms GD in each example. For the over-realizable example shown in Fig. 2(a) the 
difference is most significant, with NGD displaying no obvious symmetric plateau. Performance of the NGD algorithm 
seems to reflect the difficulty inherent in the task, while GD displays very similar performance in each case. It is 
interesting to compare our results with those found using a locally optimal rule derived by variational arguments [15]. 
The variational approach requires rather detailed information about the teacher's structure and would be difficult to 
approximate with a practical algorithm. However, we find rather similar performance with NGD, especially for the 
K = M = 2 example shown both here and in [15]. The performance bottleneck for GD is due to an inherent symmetry 
in the student parametrization, while for NGD the task complexity seems to be more important. Also notice that 
the generalization error is significantly lower during the symmetric plateau for NGD in each case, which is due to 
reduced weight vector norms (this is also true for the locally optimal algorithm) . It is the growth of these norms which 
limits increases in the learning rate for GD and it appears that NGD is much more effective in controlling this effect. 
Another interesting difference between the NGD and GD dynamics is in the short transient prior to the symmetric 
phase. The NGD dynamic seems to converge much slower to the symmetric fixed point, as shown in Fig. 2, reflecting 
the fact that the strong eigenvalues, related to eigenvectors which lead the dynamic to the symmetric pixed point, 
are effectively reweighed and suppressed by the NGD rule. 



Although our equations of motion are sufficient to describe learning for arbitrary system size, the number of 
order parameters is \K(K + 1) + KM so that the numerical integration soon becomes rather cumbersome as K 
and M grow and analysis becomes difficult. To obtain generic results in terms of system size we therefore exploit 



GD): 




(11) 



V. GENERIC RESULTS FOR A SYMMETRIC SYSTEM 
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symmetries which appear in the dynamics for isotropic tasks and structurally matched student and teacher {K = M 
and T = TS nm ). This site-symmetric ansatz is only rigorously justified for the special case of symmetric initial 
conditions and further investigations are required to determine the validity of this approximation in general for large 
values of K (fixed points other than those considered here have been reported for GD [12] and it is unclear whether 
or not their basins of attraction are negligible). Simulations of the GD dynamics for K up to 10, with random initial 
conditions, show good correspondence with the symmetric system. In this case we define a four dimensional system 
via Qij = QSij + C(l — 5ij) and Ri n = R5i n + 5(1 — 5i n ) which can be used to study the dynamics for arbitrary K 
and T (here, Sij denotes the Kronecker delta). In appendix A we show how the Fisher information matrix can be 
inverted for the reduced dimensionality system and the resulting equations of motion are given in appendix B 1. 

Analytical study of the symmetric phase for GD is only feasible for small learning rates, since in this case the 
symmetric fixed point is easily determined and a linear expansion around this fixed point is possible [9,13]. Such an 
analysis is not feasible for NGD because the dynamics never approaches this fixed point (the Fisher information matrix 
becomes singular when Q = C). In any case, a small rj analysis will be of limited value since it is the fluctuation driven 
terms in the dynamics (terms proportional to rj 2 in Eq. (10)) which set the learning time-scale and determine the 
optimal and maximal learning rate during the symmetric phase. In order to study the performance of both methods 
for larger learning rates we will therefore apply a the optimal learning rate framework described in the preceding 
section [14]. 

The impact of output noise on the symmetric phase dynamics is not considered explicitly here. For low noise levels 
there is no noticeable effect on the length of the symmetric phase, or on the order parameters and generalization error 
within this phase. For larger noise levels the symmetric phase increases in length and the student norms increase, 
resulting in a larger generalization error. We expect that these are secondary effects and that most essential features 
of this phase are captured by the noiseless dynamics. This is not true for later stages of learning, where the inclusion 
of noise completely alters qualitative features of the dynamics. These asymptotic effects are considered in section VB 
below. 



A. Globally optimal performance 

The optimal learning rate is determined as describe before Eq. (11) in section IV. In the following examples we 
use a brief initial learning phase with GD (until a = 1) as this results in faster entry into the symmetric phase and 
also leads to quicker convergence of the learning rate optimization. The effect on learning time will be negligible as 
K becomes very large, but this procedure might be used to improve performance in practice for realistically sized 
networks. 

Fig. 3 summarises our results for transient learning in the absence of noise. In Fig. 3(a) we compare optimal 
performance for K = 10 and T = 1, which indicates a significant shortening of the symmetric phase for NGD (the 
inset shows the optimal learning rate for NGD). Fig. 3(b) shows the time required for NGD to reach a generalization 
error of 10~ 4 K as a function of K (for T — 1). The learning time is dominated by the symmetric phase, so that these 
results provide a scaling law for the length of the symmetric phase in terms of task complexity. We find that the 
escape time for NGD scales as K 2 , while the inset shows that the learning rate within the symmetric phase approaches 
a K~ 2 decay. Scaling laws for GD were determined in [13] (also using a site-symmetric ansatz), showing a K's law 
for escape time and a learning rate scaling of within the symmetric phase. The escape time for the adaptive 

learning rule studied in [13] scales as K?, which is also worse than NGD. 

B. Asymptotic convergence with noise 

After the symmetric phase, the order parameters begin convergence towards their asymptotic values (i?oo = Qoo = 
T, Soo — Coo = 0) and for the realizable scenario considered here the generalization error converges towards zero 
(recall that we have defined the generalization error to be the expected error in the absence of noise) . In the absence 
of output noise this convergence is exponential for a fixed learning rate so long as we do not choose the learning 
rate too high. However, in the presence of output noise the learning rate must be annealed in order to achieve zero 
generalization error asymptotically. It is known that NGD is asymptotically optimal, in terms of the covariances of 
the student-teacher weight deviations (the quadratic estimation error), with rj = 1 /a, saturating the Cramer-Rao 
bound and equalling in performance even the best batch methods [2]. However, the quadratic estimation error has 
no direct interpretation in terms of generalization ability. In Fig. 4 we show results for optimized NGD dynamics 
with K = 5 and T = 1. Fig. 4(a) shows the generalization error and Fig. 4(b) shows the corresponding optimal 
learning rate schedules for three noise levels (a 2 = 0.1, a 2 = 0.01 and a 2 = 0.001). The graphs are on log-log scales 
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and show that the optimized learning rates indeed converge to a 1/a decay after leaving the symmetric phase. The 
generalization error decays at the same rate, but with a prefactor which depends on the noise level. 

In order to determine the asymptotic generalization error decay analytically we apply recent results for the annealing 
dynamics of GD [16]. This allows a comparison between the asymptotic generalization error for NGD and the result 
for GD. In appendix B 2 we solve the asymptotic dynamics for annealed learning. As expected, the optimal annealing 
schedule for NGD is found by setting rj = 1/a at late times. By contrast, although the optimal learning rate for GD 
is also inversely proportional to a, the optimal prefactor depends on K and T in a non- trivial manner [16]. For both 
optimized GD and NGD the generalization error decays according to an inverse power law: 

€q(j 2 , „. 

e„ ~ as a — > oo . (12 

a 

The exact result for NGD takes a very simple form to = \K independent of the value of T. This equals the universal 
asymptotics for optimal maximum likelihood and Bayes estimators which depend only on the learning machine's 
number of degrees of freedom [17]. NGD is therefore asymptotically optimal in terms of both generalization error and 
quadratic estimation error. 

In Fig. 5 we compare the prefactor of the generalization error decay for NGD and optimal GD. Fig. 5(a) shows the 
result for T = 1 as a function of K, indicating an approximately linear scaling law for GD (The result above shows 
that the NGD scaling is linear in K). In Fig. 5(b) we compare the decay prefactors for each method as a function of 
T, showing how the difference diverges as T is reduced (the GD results are for large K). This can be explained by 
examining the asymptotic expression for the Fisher information matrix, shown in Eq. (A6). For large T the diagonals 
of this matrix are 0(1/ VT) and equal (for large N) while all other terms are at most 0(1/T), so that the Fisher 
information is effectively proportional to the identity matrix in this limit and NGD is asymptotically equivalent to 
GD. However, for small T the diagonals are 0(T 2 ) while the off-diagonals remain finite, so that the Fisher information 
is dominated by off-diagonals in this limit. 



VI. CONCLUSION 



We have used a statistical mechanics formalism to solve and analyse the dynamics of natural gradient descent (NGD) 
for learning in a two-layer feedforward neural network. In order to quantify the comparative performance of NGD and 
gradient descent (GD) we compared the optimized performance of each algorithm by determining the optimal learning 
rate in each case. We found that NGD provided significant gains in performance over GD in every case examined, 
both in the transient and asymptotic stages of learning. A site-symmetric ansatz was applied in order to simplify the 
dynamical equations for a realizable and isotropic task. This allowed the dynamics of large networks to be integrated 
efficiently so that we could determine generic behaviour for large networks. We found that the learning time scaled as 
K 2 where K is the number of hidden nodes, compared to a scaling of K~s for GD [13]. Asymptotically NGD is known 
to provide optimal performance with r\ = 1/a in terms of the quadratic estimation error. An asymptotic solution to 
the annealed learning rate dynamics showed this schedule to also be optimal in terms of generalization error, with 
the error decay saturating the universal asymptotics for optimal maximum likelihood and Bayes estimators [17]. We 
compared this result with the optimized schedule for GD and plotted the relative performance for various values of 
task-nonlinearity T. The difference in performance was found to be largest for small values of T. However, in the 
case of NGD the optimal annealing schedule at late times is known, while for GD it is a complex function of K and 
T which will be difficult to estimate in general. 

One possible drawback for NGD is the rather complex transient behavior of the optimal learning rate. For example, 
in the realizable isotropic case the optimal learning rate scales as K~ 2 in the symmetric phase and K^ 1 asymptotically 
in the absence of noise. It is also unclear where learning rate annealing should begin in the presence of output noise. 
Asymptotically the optimal annealing schedule is known, so the situation is better than for GD, but the problem of 
setting a good learning rate in the transient remains. In practical applications there will also be an increased cost 
required in estimating and inverting the Fisher information matrix [4]. Here, we have only considered the idealized 
situation in which the Fisher information matrix is exactly known. In [8] we adapt a matrix momentum algorithm 
due to Orr and Leen [7] in order to obtain efficient averaging and inversion of the Fisher information matrix on-line. 
This algorithm is shown to provide a good approximation to NGD, although this is at the cost of including an extra 
parameter. 
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APPENDIX A: INVERTING THE FISHER INFORMATION MATRIX 



In general the Fisher information matrix should be inverted using the block inversion method described in [4] . The 
parameters in Eq. (9) are then complicated functions of Q which must be determined iteratively (see [19] for a similar 
method applied to the Hessian matrix for M = K = 2). Below we consider the simpler situation of a site-symmetric 
system, in which case the inversion can be carried out in closed form for arbitrary K. Asymptotically the result is 
shown to be further simplified. 



1. Site-symmetric system 

Exploiting symmetries in the dynamics of realizable isotropic learning (K = M and T nm = T5 nm ) we consider a 
reduced dimensionality system with Qij = QSij + C(l — We can then write block (i, j) of the Fisher information 
matrix as (see Eq. (6)), 

Aij = (a (% + b) I + (c % + d) J t jJ + d J 3 Jj + e (J 4 jJ + JjJ?) , (Al) 

where we have defined, 

2 2 , 4(1 + Q-C) 4 

a = — o , c 



+ Q) 2 -C 2 ) ' 7rVT+2Q ^((1 + Qf-C 2 )i ^(1 + 2Q)« 

-2(1 + Q) 2C 

a — — 



k=\ 1=1 



^((1 + Q) 2 -C 2 )5 ^((1 + Q) 2 -C 2 )5 

Block in the inverse of A is then given by, 

K K 

^a" v a(a + bK) 
and symmetries suggests the following general form for T, 

+ ifeSjkSii + niSji + Sik) + Js(Sjk + hi) + 7o4( + 7io% + 7n ■ (A3) 

We therefore have to set 11 free parameters in order to fully specify T. This is achieved by substituting Eqs. (Al) 
and (A2) into the definition of the inverse, 

K 

J2 A ik A kj = SijI Vi,j . (A4) 
fe=i 

Equating like terms leads to a set of 15 equations and we can choose any linearly independent subset of 11 equations 
in order to determine 7. For one particular choice we find, 

7 = M" 1 ^ , (A5) 

where the non-zero terms in Mn x n = [rriij] and /3 lxll = [Pi] are defined below, 



7 



mis = m 2 ,2 = m 4 , 3 = (Q - C)(d + e) + cQ + a , 
TOl,2 = m 2 ,9 = m4. 8 = c [Q + C(K - 1)] , 
mi,3 = m 2i8 = Ti4,io = (Q - C)(dK + c) , 

1711.4 = 7711,6 = C (Q — C) , m 2 ,4 = m4, 6 = C C 

m 2 ,6 = 7774,4 = "^5,4 = m 5j6 = d(Q - C) , 
m 3 ,2 = m 6A = m 8 , 6 = mn, 8 = a , 

7773,3 = ?™6,6 = 7777,4 = 7777,6 = m 8 , 4 = mn, io = e(Q - C) , 

7775.1 = 7779,5 = & + dQ + eC , 

m5,2 = m 7 , 2 = mio.g = (d + e)(Q + C(K - 1)) , 

7715,3 = d(Q - C) + Kb + a , 7717,1 = mio, 2 = mio,3 = eQ + dC 

"77,3 = 7nio,io = e(Q - C) + cC + (d+ e)KC , 

7777.5 = 77710,7 = {Q- C)(d + c + e) + a + cC + K(Qe + Cd) , 
7777,7 = "710,11 = (Q- C)(K(d + e) + c)+cCK+(d+ e)CK 2 , 

7779. 2 = b , 7779,3 = 77710,4 = 777l ,6 = (<f + e)C , 

7779,7 = #6 + a + (d + e )(Q + C(if-l)) , 
777io,8 = (Q-C)(d + 2e) + cC + 2XC(d + e) , 

c b(c + gg) d d 

Pi — , P4 — —7 TTT7T , P5 — 

a a(a + oi() a a 



a ' o(a + &g) 

2. Asymptotic inversion 

For realizable rules the asymptotic form for each block of A is (to leading order), 

Aij = (a % + b) I + (c % + d) B.B^ + d BjB] , (A6) 



where, 



2 2,2 
a = , — , 



ttVT+2T tt(1 + T) ' tt(1 + T) ' 

4 4,2 



tt(1+T) 2 7r(l + 2T)l ' rr(l + T) 2 ' 

Block (i, j) in the inverse of A is then given by, 

A -" = (^«-^ra) I + ^ rr ' B " B "' <A7) 

\ \ / / 71=1 

Substituting these expressions into Eq. (A4) and using the orthogonality of the teacher weight vectors (T nm = T5 nm ) 
we obtain a matrix equation for T n — [r™ ], 

r™ = p x x (A8) 

where, 



P = al 



e„ \ T / cT —dT \ ( e„ 



u ) \ -dT b \ u 



x _ 1 ( e„ \ f -c (da- be) /(a + bK) \ ( e„ 
a^u j ^rf -dty(a + &g) J ^ u 
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Here, we have defined e„ to be a X-dimcnsional row vector with a one in the nth element and zeros everywhere else, 
while u is a row vector of ones. Solving for T n we find, 



T 

6n l I Gr, 



r " = K( u ) ®l U ) 



where, 

= 



/ K(d 2 T - be) - ac d 2 T - be - ad 

\ d 2 T-bc-ad (d 2 T(a + b)-2abd-b 2 c)/(a + bK) 



A = a 2 (a + bK) + a 2 T(c - 2d) + aT(K - \){cb - d 2 ) . 



APPENDIX B: EQUATIONS OF MOTION 

Using the definition of A -1 given in Eq. (9) we find, 
dR 



da 

dQir 

da 



j \ kl / 

= «E ( + Orjlpji + ^2^jl{&klQkr + &kiQki) J +V 2 ^2^J^rkV jk ■ (Bl) 

j \ kl I jk 



Here 4> ln = {5iy n ){£}, tpik = (SiXk}{£} and v ik = (<Wfc){£> where x t = Jj£ and y n = are activations of the ith 

student and nth teacher hidden nodes respectively and 5f = g'(x^) [X^^Li 9(Un) ~ SjLi 9( x j ) + P* 1 ] • The brackets 
denote averages over inputs which can be written as averages over the multivariate Gaussian distribution of student 
and teacher activations. The explicit expressions for <fri n , ipik, Vik depend exclusively on the weight overlaps (the 
covariances of the activation distribution) and are given in [9]. 



1. Site-symmetric system 



We substitute the definition of the inverse Fisher information for a symmetric system from Eq. (A2) into Eq. (3) 
to get the weight update equation: 



1 N 



(B2) 



jkl 



where s = 1/a and t = —b/a(a + bK). Differential equations for the order parameters can then be derived by the 
methods described in [9] and for the reduced dimensionality system we find, 



dR 

da 
dS_ 

da 
dQ 
da 

dC 
da 



s<j> s +t{<j) s + {K - l)<j> a ) +v R + w R + z R R 

S 4> a + t (4> s + (K - l)4> a ) +WR + Z R S 

s tp s + t + (K - l)Va) + 2(v Q +w Q + z Q Q) 



s 2 v s + (2st + t 2 K)(v s + {K- l)v a ) 



s^ a + t(iP s + (K - l)Va) + 2(w Q + z Q C) 



s 2 v a + {2st + t 2 K)(v s + {K- IK) 



(B3) 



where, 
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VR = (74+76)(#-S)(V> s -^a), 

WR = ( 7 4 + 7e) [(R - S)ljj a + S(i> s + (K- l)Va)] + 
(R+ (K — 1)5) [( 72 + 73 + ^77)^ + (278 + 79 + 7io + #711) + (A" - l)Va) 

«H = (71 + lbK)lp s + ( 72 + 73 + K l7 ){4> s + {K- l)Va) , 

and i>g , wq and zq are the same except that R and S are replaced by Q and C respectively everywhere they appear 
explicitly. Here, 7 = [7^] is defined in Eq. (A5) and we have defined, 

( S iVn){t} = $in4>s + (1 - kn)4>a , = kjV s + (1 - %)^a , 

(SiXj + SjXi)^ = Sijlps + (1 - Sij)lp a , 

where % with two indices denotes the Kronecker delta and brackets denote averages over the inputs. These averages 
can again be calculated in closed form [9]. 

2. Asymptotic dynamics 



The asymptotic dynamics for GD with an annealed learning rate have recently been solved under the statistical 
mechanics formalism and the optimal generalization error decay is known in this case [16]. Here we extend those 
results to NGD. 

Following the notation in [16] we define u = (R - T, Q - T, 5, C) T to be the deviat ion from the asymptotic fixed 
point. If the learning rate decays according to some power law then the linearized equations of motion around this 
fixed point are given by, 



du 

da 



77 Q Mu + ? ? 2 (J 2 b , 



(B4) 



where r\ a M is the Jacobian of the equations of motion to first order in r\ while the only non-vanishing second order 
terms are proportional to the noise variance. For T = 1 we find, 



M = 



Cl 
16 



c 3 



c 2 -6(K-l)c 3 3(K-l)c 3 \ 
c 4 -12(^-1)03 6(K-l)c 3 



c 5 
2c 4 



8V3 -4^3 
V 16V3 -8V3 

A = tt(25 - 16V3 + K{8V3 - 9)) , 
ci = 2(32^3 - 41 - K(16V3 - 9)) , 
c 2 = (57- 48V3 + if(24V3-9))/2 
c 3 = 2(2V3 - 6 + 3K) , 
c 4 = 7 - 16V3 + K{9 + 8V3) , 
c 5 = 16V3 - 43 + if (27 - 8V3) , 



-9(K- 1) 
"7^ 



and the two non-zero entries in b arc, 



b 2 
b 4 



tt(38V3 - 66 + K(5l - 30\/3) + if 2 (6\/3 - 9)) 

(2+(if-l)V3)(V3-2) 2 
-3rr(7 - 4V3 + K(2y/3 - 3)) 



(2+(K-l)V3)(V3-2)2 
The solution to Eq. (B4) with r\ a = i] /ot is, 

u(a) = <r 2 VXV _1 b 



(B5) 



where V 1 MV is a diagonal matrix whose entries Xi are eigenvalues of M. We have defined the diagonal matrix X 
to be, 



10 



rdiag 



1 + XiTjo 



a 



A;»?o Q ,7( 1 + A *>Jo) 



(B6) 



where annealing begins when a = a . For natural gradient learning we find two degenerate eigenvalues Ai ; 2 = — 1, 
A3, 4 = —2 and by substituting Eq. (B6) into a first order expansion of the generalization error it is straightforward 
to show rjo = 1 to be optimal. In this case the modes corresponding to A1.2 do not contribute to the asymptotic 
generalization error and for all values of T we find, 



2a(l + 77 A 3> 4) 



2a 



(B7) 
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FIG. 1. Numerical integration of the NGD equations of motion. A two-node soft committee machine learns from examples 
generated by a two-node isotropic teacher (K = M = 2, T nm = S nm ) in the absence of noise. The learning rate is fixed at 
r\ — 0.05 and initial conditions are Ri„,Qi^k G U[0, 10 ] and Qa 6 U[0, 0.5]. The generalization error is shown by the solid line 
in (a) with the exponential asymptotic decay shown on a log scale in the inset (the dashed line shows the effect of reducing the 
initial conditions by a factor of 10~ 3 ). The student-teacher and student-student overlaps are shown in (b) and (c) respectively. 




a a a 



FIG. 2. The optimal performance of a two node student (K = 2) for NGD (solid line) and GD (dashed line) for (a) 
over-realizable: M = 1, (b) realizable: M = 2 and (c) unrealizable: M = 3 learning scenarios. The optimal learning rate 
schedule for NGD is shown by the inset to each figure. The teacher is isotropic (T nm = S nm ), noise free and initial conditions 
are as in Fig. 1 
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a 
K 




FIG. 3. In (a) the generalization error is shown for optimal NGD (solid line) and optimal gradient descent (dashed line) for 
K — 10 in the site-symmetric system (we define a = 10~ 2 a). The inset shows the optimal learning rate for NGD. In (b) the 
time required for optimal NGD to reach a generalization error of W~ 4 K is shown as a function of A' on a log-log scale. The 
inset shows the optimal learning rate within the symmetric phase. In both (a) and (b) we used T — 1, zero noise and initial 
conditions R = 1CT 3 , Q = U[0, 0.5] and S = C = 0. A brief stage of GD is used before NGD is started. 
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FIG. 4. Optimal NGD with teacher corrupted by Gaussian noise for K = 5 and T = 1 in the site-symmetric system, shown 
on a log-log scale. The generalization error and optimal time-dependent learning rates are shown in (a) and (b) respectively, 
with cr 2 = 0.1 (solid line), a 2 = 0.01 (dashed line) and a 2 = 0.001 (dot-dashed line). 




FIG. 5. Prefactor for the asymptotic decay of the generalization error: (a) shows the prefactor for T — 1 as a function of K 
for optimal GD (circles) and NGD (crosses) while (b) shows how the prefactor for optimal GD (large K) decays towards K/2 
as T increases, which is the prefactor for NGD. The inset to (b) shows the GD result on a log-log scale. 
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